This analysis performs sample classification using high-dimension data. It includes two main steps:

The first step uses the t-SNE (t-Distributed Stochastic Neighbor Embedding) to create a 2D or 3D-map from high dimensional data. The Barnes-Hut version of t-SNE is used to reduce computing time, so it can be run for multiple times. The result of t-SNE run having the lowest Kullback–Leibler (KL) divergence will be used for the next step.

The second step uses the DBSCAN method to cluster samples using the t-SNE result from the last step. The two key parameters of DBSCAN, epsilon neighborhood and minimum cluster size are adjusted based on silhouette score to obtain optimal classification.

Additionally, the result of unsupervised clustering is compared to any known sample features to evaluate their association.

 

Go to project home

1 Description

1.1 Project

Comparison between cell lines from 9 different cancer tissues (NCI-60); GSE5949

1.2 PubMed

Reinhold WC, Reimers MA, Lorenzi P, Ho J et al. Multifactorial regulation of E-cadherin expression: an integrative study. Mol Cancer Ther 2010 Jan;9(1):1-16. PMID: 20053763.

1.3 Experimental design

Comparison between cell lines from 9 different cancer tissue of origin types (Breast, Central Nervous System, Colon, Leukemia, Melanoma, Non-Small Cell Lung, Ovarian, Prostate, Renal) from NCI-60 panel

2 Results

2.1 t-SNE

t-SNE (t-Distributed Stochas- tic Neighbor Embedding) is a method of feature reduction that maps high-dimentional data to a 2- or 3-dimensional space. This analysis utilizes the Barnes-Hut version on a data matrix with 60 samples and 17647 features. The method is implemented by the Rtsne {Rtsne} R function.

On a t-SNE space, the similarities between samples is measured by their overall KL (Kullback–Leibler) divergence. An optimal t-SNE run should have relatively lower KL divergence. Each t-SNE run of this analysis involves the following sub-steps:

  • Run an initial PCA to reduce the number of features to 50;
  • Calculate the euclidean distance between any two samples to get a similarity score;
  • The distance of each sample to 2 nearest neighbors is fit to a Cauchy distribution, on a 2-dimensional space;
  • The Barnes-Hut approximation is used to reduce computational complexity, with theta = 0.5;
  • The position of samples on the low dimensional space is adjusted for up to 1000 iterations to lower overall KL divergence.

The steps above are re-run for 100 times, and output from the run with the lowest KL divergence is the final result of t-SNE.

Figure 1. The t-SNE program is run repeatedly for 100 times, and the final Kullback-Leibler divergence, or total cost, of the map generated by all runs is summarized in this figure. Click here to download the full table of KL divergence.
Figure 2. t-SNE runs with the highest and the lowest KL divergence. The run with the lowest KL divergence is picked as the final t-SNE results and input to DBSCAN classification.

2.2 DBSCAN

DBSCAN (Density-based spatial clustering of applications with noise) is a sample clustering method based on the density of sample on a multi-dimensional space. This analysis uses the dbscan {dbscan} R implementation of this method. Its input is the result of t-SNE run with the lowest KL divergence and its output is evaluated by the silhouette score of clustered samples. Each DBSCAN run requires the following parameters:

  • eps (required): size of the epsilon neighborhood. This analysis uses a iterative procedure to identify the eps value maximizing the silhouette score of clustered samples.
  • minPts (required): minimal number of core points (samples) to form a cluster. This analysis tests the minPts value from 2 to 12 to identify the optimal eps/minPts combination with the highest silhouette score.
  • search: nearest neighbor search strategy = “kdtree”.
  • bucketSize: max size of the kd-tree leafs = 10.
  • splitRule rule to split the kd-tree = “suggest”.
  • approx: relative error bound for approximate nearest neighbor searching = 0.
Figure 3. The average of silhouette scores corresponding to each value of minimal cluster size. The minimal cluster size having the highest score is highlighted in red.

The optimal value of minPts (minimal number of samples in a cluster) is 2. The corresponding clustering result of DBSCAN will be used for the rest of this analysis.

Table 1 Summarization of clusters identified by DBSCAN using t-SNE output. Click here to download the full table of classification results.
Cluster_ID N Neighbors Width_Mean Width_Median Width_Min Width_1stQu Width_3rdQu Width_Max
Cluster_0 0 NaN NA NA NA NA NA
Cluster_1 2 4 0.86 0.86 0.85 0.85 0.86 0.87
Cluster_2 15 4;5;9;11 0.37 0.43 -0.13 0.30 0.51 0.58
Cluster_3 3 10 0.94 0.95 0.93 0.94 0.95 0.95
Cluster_4 5 1 0.63 0.68 0.41 0.64 0.68 0.74
Cluster_5 12 1;2 0.67 0.70 0.41 0.64 0.74 0.77
Cluster_6 6 1 0.80 0.82 0.71 0.76 0.84 0.84
Cluster_7 4 8 0.67 0.70 0.54 0.65 0.72 0.75
Cluster_8 5 7 0.63 0.68 0.48 0.56 0.71 0.73
Cluster_9 4 2;11 0.76 0.78 0.66 0.72 0.82 0.82
Cluster_10 2 2;5 0.96 0.96 0.95 0.95 0.96 0.96
Cluster_11 2 9 0.91 0.91 0.91 0.91 0.91 0.91
Figure 4. Samples are clustered via DBSCAN using the results from t-SNE run with the lowest KL divergence. Samples in the same clusters have the same colors. Cluster_0 is made of samples cannot be clustered.

2.3 Cluster-feature association

If the samples have been previously labeled with their known features, such as genotype, treatment and disease state, the agreement between these features and DBSCAN classification can be evaluated to discover potential feature-classification association.

Table 2 Summary of cluster-feature association.

  • Num_Group: Number of sample groups according to the feature;
  • Rand: Rand index; C_Rand: corrected RAND index;
  • ChiSq: Chi-squared test statistics;
  • P_ChiSq: p value of Chi-square test;
  • Count_Expected: table with expected counts within each cell of the contingency table;
  • Count_Observed: table with observed counts within each cell of the contingency table;
  • Obs/Exp: table with expected vs. observed counts.
Feature Num_Group Rand C_Rand ChiSq P_ChiSq Count_Expected Count_Observed Obs/Exp
Organ 9 0.86 0.350 253.8560 0.0001 Table Table Table
Sex 3 0.58 0.039 32.9116 0.0320 Table Table Table
p53_Status 3 0.54 0.013 22.2591 0.3200 Table Table Table
Figure 5. The contingency table of the clusters and the known sample feature has the most significant association based on Chi-square test. The numbers are the counts of samples at each intersect. The colors indicate the ratios of observed over expected sample counts.

3 Appendix

Check out the RoCA home page for more information.

3.1 Reproduce this report

To reproduce this report:

  1. Find the data analysis template you want to use and an example of its pairing YAML file here and download the YAML example to your working directory

  2. To generate a new report using your own input data and parameter, edit the following items in the YAML file:

    • output : where you want to put the output files
    • home : the URL if you have a home page for your project
    • analyst : your name
    • description : background information about your project, analysis, etc.
    • input : where are your input data, read instruction for preparing them
    • parameter : parameters for this analysis; read instruction about how to prepare input data
  3. Run the code below within R Console or RStudio, preferablly with a new R session:

if (!require(devtools)) { install.packages('devtools'); require(devtools); }
if (!require(RCurl)) { install.packages('RCurl'); require(RCurl); }
if (!require(RoCA)) { install_github('zhezhangsh/RoCAR'); require(RoCA); }

CreateReport(filename.yaml);  # filename.yaml is the YAML file you just downloaded and edited

If there is no complaint, go to the output folder and open the index.html file to view report.

3.2 Session information

## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RoCA_0.0.0.9000     flexclust_1.3-4     modeltools_0.2-21  
##  [4] lattice_0.20-33     cluster_2.0.4       e1071_1.6-7        
##  [7] gplots_3.0.1        dbscan_0.9-8        Rtsne_0.11         
## [10] webshot_0.3.2       plotly_4.5.2        ggplot2_2.1.0      
## [13] htmlwidgets_0.7     DT_0.2              awsomics_0.0.0.9000
## [16] yaml_2.1.13         rmarkdown_1.0       knitr_1.14         
## [19] RCurl_1.95-4.8      bitops_1.0-6        devtools_1.12.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6        git2r_0.15.0       highr_0.6         
##  [4] RColorBrewer_1.1-2 formatR_1.4        plyr_1.8.4        
##  [7] class_7.3-14       base64enc_0.1-3    tools_3.2.2       
## [10] digest_0.6.10      jsonlite_1.0       evaluate_0.9      
## [13] memoise_1.0.0      tibble_1.1         gtable_0.2.0      
## [16] viridisLite_0.1.3  DBI_0.4-1          curl_1.2          
## [19] parallel_3.2.2     withr_1.0.2        stringr_1.0.0     
## [22] dplyr_0.5.0        httr_1.2.1         caTools_1.17.1    
## [25] gtools_3.5.0       R6_2.1.2           gdata_2.17.0      
## [28] purrr_0.2.2        tidyr_0.5.1        magrittr_1.5      
## [31] scales_0.4.0       htmltools_0.3.5    assertthat_0.1    
## [34] colorspace_1.2-6   KernSmooth_2.23-15 stringi_1.1.1     
## [37] lazyeval_0.2.0     munsell_0.4.3

END OF DOCUMENT